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[57] ABSTRACT 

A RAID-compatible data storage system which allows incre- 
mental increases in storage capacity at a cost diat is propor- 
tional to the increase in capacity. The system docs not 
require changes to the host system. The control and inteif ace 
functions previously performed l>y a single (oi redundant) 
central data storage device controller are distributed among 
a number of modular control units (MCUs). Each MCU is 
preferably physically coupled to a data storage device to 
form a basic, low-cost integrated storage node. One of two 
bus pats interfaces an MCU with the host computer on a 
host bus, and the other bus port interfaces an MCU with one 
or more data storage devices coiq>lcd to the MCU by a data 
storage device bus. The serial interface ports provide a 
means by wliich each of the MCUs may conomunicate with 
each odier MCU to fadliute the inqslementation of a 
memory array architecture. The entire data storage anay 
may qipear as a single device capable of responding to a 
single identification number on the host bus, or may appear 
as a number of independent device. A controlling MCU 
receives a command and notifies die other MCUs that arc 
involved in a read oc write operatioii. Control of die host bus 
is transferred firom one MCU to die next MCU in sequence 
so that the data is received by the host computer, or written 
to each data storage device, in the proper order. 
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DISTRIBUTED DISK ARRAY accessed by the "single" actuator. Thus. RAID 2 systems are 

ARCHITECTURE generally not considered to be suitable for computer systems 

designed for on-one TYansaction Processing (OLTP), such as 
This is a continuation of plication Ser. No. 08/4 15, 157 in banking, financial, and reservation systems, where a large 
filed Mar. 31 1995 patented U.S. Pat. No. 5.689,678 which 5 number of random accesses to many small data files corn- 
is a continuation of a|>plicatioD Ser. No. 08/029J94 filed Ibises the bulk of data storage and transfer operations. 
Mar. 11, 1993, now abandoned. A RAID 3 architecture is based on the concept that each 
^ ^-,^«^,^„^ «.««rv™r^vT ^sk drive storage unit has internal means for detecting a 
BACKGROUND OF THE INVENTION fault or data OT^.TTicrefoic. it is not necessary to store extra 

1 . Held of the Invention information to detect the location of an enor; a sin^>ler form 
This Invention relates to data storage systems, and more of parity-based error correction can thus be used. In fliis 

particularly to a method and apparatus for storing data on approach, the contents of aU storage units subject to faUure 

multiple redundant data storage devices. are "Exclusive OR'd** X(OR*d) to generate parity informa- 

2. Description of Related Art '^^^^S parity information is stored in a single 

^ . ^ _^ J t- ^5 redundant storage umte. If a storage umt fads, the data on 

As computer use maeases. data st«age needs have be xeoonstmetrf onto a replacement storage 

lncrMsedmninore.InMj«teinpttopiovidc^^^ ^ ^ 

«^ data storage lhat ^ both uiexpensrve «id rdiaWe^ « Lorn«tion. Sud> an airaagtment has the 

becomiog increasingly common to use large numbeis of .jvanUge over the mirrored disk RAD) 1 architecture in that 

smaU, inexpensive cUta storage devices whidi work m ^ only one additional storage uml is required for "N" storage 

umson to make avaUable a reUable large d^^ J ^ ^^^^ ^ 3 ^^hitecture is that the 

ity. In a papw entiUed Case for Redundant Arrays of ^ , ^ ,^ ^i^^ ,o a 

Inexpensive Disks (RAXD)", Patterson et a^^, Pn,c. ACM j system. Md a single disk drive is designated as the 

SIGMOD. June 1988. the University of Califcynia at Ber- uiS. 

keley has catalogued a set of concerts to address the . , „ - _ , a. 

problems of pooUng multiple small Ht^ storage devices. " ^p«e Imptencntadon ^ « RATO 3 «^ 

TliePattersonrefexencecharacterizesanaysofdiskdrivesin Mioxipolis Corporation ParaUcl Dnvc Array. Model 184 

one of Ave architectures under the acronym "RAID". SCSL ftat uses four ,«nUlel^nchronl2ed disk «Uves and 

. * . • J- J I- * * one redundant panty dnve. The failure of one of the four 

AiyODlarclutecwremvolves^^^ data disk drives^ be remedied by the use of the parity Wts 

of'taror-storagemuuandkeepmgajtap^^^^^ 30 on the parity disk drive. Another example of. RAID 

data on eadj pair of storage "-"ts. Whdc such a solution ^ to Ouchi 

solves the reliability problem, it doubles the cost of storage. *' , 

Anumber of implem^tadons of RAID 1 architectures have A RATO 3 d«k drive memory ^sttm has a much lower 

been made. inMrticular by Tindem Corporation. °' redundancy units to data umts than a RAID 2 system. 

A RAID 2 addtectore stores each bftof each word of 35 * ^ has tiie same I^fonnance 

A . ^^^^iJ^ r^^^m^ i^fZ limitation as a RAID 2 system, in fliat the individual disk 

dabu phjs EiTor Detection and Coirectton (EDQ btfs fo ^ 

^ " ^ffi^e random wc^perflrmancc of die drive arraj 

No. 4.722.085 to Rora et al. discloses a disk dnve memory "'t louuwui avvvoo t.«»vi.ji»u , 

usingapluralityofrd^^^^^ rr^^^^^^^S^^^^^ 

high data trLsfcr bandwidth. A data organizer adds 7 EDC ^^^^ ^ ^LTP pmposcs. 

biu (determined using die weU-known Hamming code) to ARAID 4 aichitcctuic uses the same parity error coirK> 

each 32-bit data word to provide error detection and error tion accept of the RAID 3 architecture, but improves on the 

collection capability. The resultant 39-bit word is written. 43 pafwrnaDce of a RAID 3 system with respect to randwn 

one bit per disk drive, on to 39 disk drives. If one of the 39 reading of small files by •iincouplmg^ the operation of the 

disk drives fails, the remaining 38 bits of each stored 39-bit individual disk drive actuators, and readmg and writing a 

word can be used to reconstruct eadi 32-bit data word on a l^ger minimimi amount of data (^oOly, a disk sertor) to 

word-by-word basis as each data wwd is read from the disk eadi disk (this is also known as block stnpiog). A further 

drives, thereby obtaining fault tolerance. so ^^P^ ^^^^ RAID 4 ardutecture is that a smgle storage unit 

An obvious drawback of sudi a system is Uic large designated as the parity uniL 

number of disk drives required for a minimnm system (since A limitation of a RAID 4 system is that wntmg a data 

most large coii^>uters use a 32.bit word), and die relatively Wock on any of the independently operating st»age units 

hi^ ratio of drives required to store tiie EDC bits (7 drives also requires writing a new parity block on the panty umte 

out of 39). A further limitation of a RAID 2 disk drive 53 The parity infOTnation stored on the parity umt must be read 

memory system is that the individual disk actuatcfs are and XOR'd witfi die old data (to **removc" the information 

operated in unison to write each daU block, tiic bits of which content of die old dato), and the resulting sum must then be 

are distributed over all of the disk drives. This arrangement XOR^d with the new data (to provide new parity 

has a high daU transfer bandwidth, since each individual information), Botii the data and the parity records tficn must 

disktransfers part of ablock of data, die netcffcctbcing that fio be rewritten to the disk drives. This process is commonly 

die entire block is available to die confer system much refcnred to as a llcad-Modify-Write" (RMW) operation. 

&stcx than if a single drive were accessing the block. This is Thus, a read and a write operation on the single parity unit 

advantageous for large data blocks. However, dds arrange- occurs each liroe a record is changed on any of the storage 

mcnt effectively provides only a single read/write head uouts covered by a parity record on die parity unit The parity 

actuator for the entire storage unit. This adversely affects die 63 unit becomes a botde-neck to data writing operations since 

random access performance of the drive array when data the number of changes lo records which can be made per 

files are small, since only one daU file at a time can l>e unit of thne is a function ofdie access rate of die parity unit. 
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as opposed to the faster access rate provided by parallel suppQiting coninuinications between the maximum number 

operation of the multiple storage units. Because of this of data storage devices, (2) have sufficient internal buffer 

limitation, a RAID 4 system is generally not considered to memory to allow the data storage device controller to 

be suitoblc for computer systems designed for OUT pur- receive and manage data destined for the maximum number 
poses. Indeed, It appears that a RAID 4 system has not been 5 of storage devices, (3) be capable of handling a sufB- 

implemented for any commercial purpose. ^^^^y ^^rge number of interrupts to commutdcate with a 

^n^^j- «... _^ -1- host computer and each data storage device, and (4) be fast 

ARAIDSarctatertuicweslhc enough Shandlemaflagemcat functions associated wiih the 
tloo concept of the RAID 4 aichitecturc and lodepeodeitt „^ ^ ^^^j^ in a RAID con- 

actuators, but ui^roves on the wnUng pcrfannance of a figuration 

RAID 4 system by distributing tiie data and parity informa- V , * ^ ^ ^ 

^ «. *i 11 J- , J • TvL' 11 «VT. t« Also, me addition of more data storage devices than can 

Hon across all of the available dislc dnves. TVpically, "N+l wuji^i. tj*^ 

. ^ / , I »*- J »\ be handled by a single data storage device controller 

storage umts in a set (also known as a ^redundancy group ) 7^ a • - n * u ^ 

"r.jj.^ i *.^ XT ,. i ^ requires another data storage device controller to be added 

arc divided mto a plurahjy of equaUy sized address areas J ^ 

refened to as Wocto. Bad, storage unit generally contain^ sTfunction (i.e.. (he cost of adding addidonaJ dau 

ttie same number of blocks. Blocks from «|ach storage uni « j^^^^^^ inrdatively smaU increments for 

'°rAl^.!l?,?^L^'^£.?t^Thr4 tti^ t dTLaJedeVicea. up to the point at which an «Iditiond 

are referred to as smpes B^h stripe has N b ocks of dauu ^^Uer must be add^ at time the cost of expan- 

f:r^'^S::lLT^::^;t.t^^Z^ZZ X;-- a mo* h^ger in.^. to pay the 
a parity block the parity blocks being distributed on different 20 a**^^ 

storage units. Parity updating activity associated with every ^ f foregoing th«e is a need for a RAH)- 

modmcation of data iTa redundancy group is therefore compatible ^ta storage ^V^t^^^^J a control system for 

distributed over the different stcvage units. No single unit is ^^''^ ^ ^ each maemental in^ease 

burdened with aU of the parity u^e activity. ^^^^^ *^P^^^ '^^^ ^ I^oportional 

. «x « _i • B 1- 25 increase in capacity. It would also be desirable if such 

For example, m a RAID 5 system comprismg 5 disk ^ ^ implemented so that no 

dnves. the parity if on^tton for Ae tet stripe of blodcs aeS be made to a host computer, 

may be wnttea to the fifth dnve; the parity information for " ^•^^^ s^.m^»s^^ *^^a^^ <..^ « 

^ J ^ 1.. 1 . u . * The oresent invention provides such a data storage sys- 

tfie second sti^ of blocks may be wntten to the fourth ^ -© / 

drive; the parity information for the tiiird stripe of blocks 

maybe written to the third drive; etc. The parity block for ^ SUMMARY OF THE INVENTION 

succeeding strq)cs typicaUy Vcccsses" around the disk xhc present invention is a RAID-compaU^lc data storage 

drives in a helical pattern (although other patterns may be system which allows incremental inaeases in storage capac- 

used). ity at a cost that is proportional to the increase in capacity. 

In addition to the five RAID architectures* a sixth archi- The control and interface functions jreviously performed by 

tecture is sometimes referred to as ''RAID 0**« even ttiough a single (or redundant) central data stcH^age device controUa 

it lacks redundancy. RAID 0 is a collection of data storage are distributed among a number of modular control units 

devices in which data is spread (striped) over several data (MCUS) coopcrativeiy operating in parallel. In the preferred 

storage devices to achieve higher bandwidth, but with no embodiment, each MCU is pfaysicaUy coupled to a data 
generation or storage of redundancy information. ^ storage device to form a basic, low-cost integrated storage 

All of the conventional RAID configurations use a central node. Additional data storage devices may be added to this 

data storage device controller to cooxdinate a transfer of data basic storage node. The system does not require changes to 

between a host coii^)utcr and the amy of data storage the host system. 

devices. The central daU storage device controller (1) deter- In ttie preferred embodiment of the present invention, 
mines to whidi particular data storage device within an anay 45 each MCU includes at least two bus intcif ace ports, one or 
to write data, (2) generates and writes redundancy two serial intaface poits, a processor optimized for inter- 
infozmatiott, and (3) reconstructs lost data from the redun- processor communications control and management, ran- 
dancy information upon a failure of a data storage device. dom access memory (RAM), and read-only memory 
RG. 1 is an example of such a system. A central data storage (ROM). One of die bus poets interfaces an MCU with the 
device controller 1 is coupled to a host computer 2 by a host host computer on a host tnis. and the other bus pctt interfaces 
bus 3. A plurality of data storage devices 4 arc coupled to the an M(ZU with one or more data storage devices coupled to 
data storage device contr<dler 1 by a plurality of device the MCU by a data storage device (DSD) bus. The MCU's 
buses 5. The data storage device controller 1 distributes data are preferably intolinked in a ring configuration through the 
over the bus 3 to each of the data storage devices 5. A system serial interface ports . The MCU* s use a ''store and forward"" 
in which a redundant data storage device controller is added 55 protocol for passing control information around the ring, 
to eliminate the data storage device controller as a single The serial interface poits fn^ovide a means by which each 
point of failure is taught in a co-pending application owned MCU may oommumcate v^th each other MCU to facilitate 
by the assignee of die present invention (U.S. patent appli- the implementation of a memory array architecture, such as 
cadon Ser. No. 07/^52374). a RAID architecture. Paired MCUs can be configured as a 
However, in RAID systems which use a central data 60 RAID 0 or 1 system, and three or more MCUs can be 
storage device cootroller to manage individual data storage configured as a RAID 0, 3, 4, or 5 system. Increments in 
devices, the full expense of a controller capable of control- storage capacity can be made by adding data storage devices 
ling die maximum number of dau storage devices is needed, to the DSD bus of one or more MCUs ("vertical- 
even if only a few data storage devices (such as 3, the expansioo)» or by adding additional MCUs with at least one 
minimum number for a true RAID 3 or 5 system) are to be 63 attached daU storage device Clkorizontal** expanston). 
used for a particular computer system. This means that the Indenification on numt>ers or codes are 'logfcally" 
central data storage device cootroUer must: ( I) be a9>abte of assigned to each MCU coupled to die host bus, and MCU* s 
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can monitOT, or "snoop**, the host bus. Therefore, ttie entire The details of the preferred embodiment of ttic present 

data storage array (including a plurality of MCUs) may invention arc set forth in the accompanying drawings and 

appear as a single device capable of responding to a single the description below. Once the details of the invention are 

identliication number on the host bus. or may appear as a Imown Dumcrous addidonal innovations and changes will 

number of independent devices, each having discrete iden- 5 become obvious to one skilled in the ait. 
tiilcation numbm on the host bus. The ability to have more 

than one MCU appear to the host computer as a single node BRIEF DESCRIPTION OF THE DRAWINGS 

on the host bus means that the only limit on the number of , . . ^ » a c 

MCUs diat can be present on the host bus is the physical and 1 is a block diagram of a pnor art RAID oonfigu- 

electrical limitations imposed by the ability of the bus ration. 

drivers and receivers to reliably transmit the signals b^een FIG. 2 is a block diagram ci one configuration of the 

the host computer and each MCU. present invention. 

In addition to the MCUs, other devices, such as conven- FIG. 3 Is a block diagram of a modular control unit in 

tional storage arrays or stand-alone data storage devices, accordance with the prd^ecred embodiment of the present 

be coupled directly to Che host bus. Thus, additional invention, 

peripheral devices may be directly accessed by all of the p|Q 4 ^ diagram of a logical system configuration in 

MCUs across the host bus, as wcU as by the host ccMi5)utcr. accordance with the preferred embodiment of the present 

When the host computer requests that data be read firom, invention. 

» written to, one or more data storage devices through one ^ 5^ pjQ 53 ^ diagrams showing a non- 

OT more MCUs, one of the MCUs connects over the host bus ^ redundant read operation in accordance with Ac prcfcned 

with the host computer to serve as a controlling MCU. The cnj^odiment of the present invention. 

controlling MCU receives a copy of a command desoiptor ^ . w • a a * 

. . ^ ^ ^ T*i. FIG. 6 is a diaeram showine a non-redundant write 

block (CDB) that specifies an mput/output operation and the ^ * ZT^™ * 1 «f 

^ * 1 ^ ^ nrL ^ IV ^MrrwT oDCTation m aocordanoe with the preferred embodiment of 

data blocks involved m the operation. The controlhng MCU "J^ 7*. •vw.^-i 

then notifies the other MCUs that are involved in die read or 25 ™ mvenuon. 

write operation. For a read operation, each MCU coupled HG. 7 is a diagram showing a redun^t read opaation 

directly by its DSD bus to one or more data storage devices after a faUure, in accordance with the prefcncd embodiment 

on which at least part of the requested data is stored begins of the present invention. 

requesting data firom die i^opriate data storage device <x FIG. 8A and FIG. 8B are diagrams showing a redundant 

devices. Control of the host bus is passed to the MCU which write operation, in accordance with Ihe preferred embodi- 

is coupled to the data storage device from which, <x to mcnt of the present invention. 

which, the first data block is to be read or written. If that data Like reference numbers and designations in the various 

storage device is available \spon transf^ of control of the drawings refer to like elements. 

host bus to that 'lead" MCU, data Is transferred between the 

host con^JUter and tiic lead MCU without disconnecting the 35 DETAILED DESCRimON OF THE 

host coix^ter from the host bus. However, if that data INVENTION 

storage device is not available when the lead MCU takes Throughout this description, the prefened embodiment 

control of the host bus, then the host computer is discon- and examples shown should be considered as exemplars. 

nectod from the host bus. The lead MCU is responsible for rather than as limifarifwe on the present invention. 

reestablishing the connection to the host computer when that ^ System Architecture 

data storage device becoraes available. Yhc present invention is a RAID-compatible m^od and 
Control of the host bus is transferred from the lead MCU apparatus for interfacing a host computer with a plurality of 
to the next MCU in sequence, which reads or writes the next (jata storage devices such (hat the control and management 
data block, so that data is received by the host computer, or of each data storage device is transparent to the host corn- 
written to each data storage device, in proper order. When 45 putcr (Le., requires no special changes to the host conqxitcr). 
the last blodc is transferred between the host computer and Contr<d of die plurality of data storage devices is distributed 
an MCU across the host bus, the MCU that made the last among a plurality of Modular Control Units (MCUs) so that 
transfer sends a "con^lete** message to the host computer the cost of expanding data storage is propoitional to the 
and disconnects from die host bus. In the preferred embodi- incremental increase in capacity of the data storage, 
ment of die present invention, the last MCU to communicate 50 FIG. 2 is a sirq>lified block diagram of the prefened 
with the host con[^>uter in response to a particular request embodiment of the present invention. A distributed disk 
directed to a logical MCU identification number is tcspon> array architecture 200 is shown which includes a host 
sible for servicing future requests from the host c(»Qputer coc^utcr 201 coupled by a host bus 267 to diree MCUs 203 
directed to that identification numba. and two stand>alone data storage devices 205. The host bus 
The invention also encompasses the use of data caching to 55 207 is preferably the well-known Small Computer System 
inqsrove performance, and **warm spares** of data storage Interface (SCSI) bus. Use of a standard SCSI bus (or die 
devices to provide for on-line automatic rebuilds of data newa SCSI U bus) means that the host computer 201 
stored on a failed data storage device. communicates witti the MCUs 203 in a standard way. 
In addition to die aibove advantages, the invention pro- without special changes to die host computer 201 and 
vides a host computer with a large amount of data storage 60 witfiout requiring a costly custom bus. 
while i^ipearing to the host con^ter as one or mare huge Bach MCU 203 is also coupled to at least one data storage 
standard storage devices. The invention allows a significant device 209 by a data storage device (DSD) bus 211, which 
increase in system performaDce by providing concurrent is preferably a SCSI bus. FIG. 2 shows three DSD buses 211. 
input/output operations by a number of data storage devices Each is independent of the other and of the host bus 207. 
widiout dianges to the host con^puter. The invention also 65 The condxlnation of an MCU 203 and at least one data 
provides a relatively low-cost, approximately linear txpaor storage device 209 coupled to the MCU 203 is referred to as 
sion capabiUty. a **node'' 213. In the illustrated embodiment shown in FID. 
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2. one data storage device 209 is coupled by a DSD bus 211 logical volume may include the logical disks stored on one 
to a first MCU 203a to define a first node 213a. Similarly* ox more physical data storage devices 209. Logical volumes 
one data storage device 209 is coupled to a second MCU are organized into *Yedundancy groups*'. Each redundancy 
2036, and three data storage devices 209 are coupled to a group comprises one or more logical volumes, all having the 
third MCU 203c, to define second and third nodes 2136. s same **stnping depth**. The strifMng depth is the number of 
213c respectively. data blocks that are consecutively written to a single logical 

The piefexred embodiment is described herein as using disk before starting to write blocks to a next logical disk. 
SCSI buses for the host bus 207 and the DSD bus 211. The In the prefemd embodiment of the present invention, the 
SCSI and SCSI n buses are weli*known. and SCSI com^ host confer 201 is responsible for setting the configura- 
patibie data storage devices 209 are inexpensive and widely lo tion of logical disks and volumes. This may be done, for 
available. An advantage of both types of SCSI bus is that example. lo the manner described in U.S. patent application 
diey allow a data storage device 209 on the bus to logically Ser. Na 07/612.220. entitled 'lx>gic Partitioning (tf a 
"disconnect** from the bus while performing a seek opera- Redundant Array Storage System*', and assigned to the 
tion and transferring data to/from a local track buffer in the assignee of the present invention. In addition, a SCSI 
data storage device 209. As described below, thb disconnect 15 identification number is associated with a respective logical 
feature is useful in implementing the present invention. volume. (Alternatively, a Logical Unit Number, or LUN, is 
However, use of SCSI buses in the illustrated embodiment associated with a respective logical volume.) Data is striped 
is described only as an exan^le of the preferred waplcm&n-- across all of the logical disks in a logical volume using the 
tation of the present invention. It should be understood diat associated striping depth. When die host computer 201 
the present invention is not limited to use only with SCSI 20 requests a read or write operation to one of the data storage 
buses. devices 209, the host computer refers to the logical address 

Each MCU 203 prefaably has two additional communi- (Le., logical disk and volume). When the invention is 
cations pacts, such as serial ports 311 (sec FIG. 3). At least inq)lementcd using a SCSI host bus 207, input/output (I/O) 
one of the serial ports 311 is coupled to a serial port 311 of requests are made using a command descriptor block. As is 
an adjacent MCU 203. as shown in FIG. 2, to fOTin a serial 25 known in the art, a SCSI command descriptor block is 
communications link 212 between an array of nodes 213. typically a 6. 10. or 12 byte block that contains an operation 
Althou^ MCUs 203 can communicate with each other over code (e.g., a "Read** or **wrile** code), the logical volume 
the host bus 207. normal control messages are passed number to which the operation is directed, the logical block 
between MCUs 203 only over the serial conununications address for the start of an operation, and the transfer length 
link 21Z 30 (in blocks) if the operation code involves a data transfer. 

Id the preferred embodiroeot of the present invention, MCU Architecture 
each MCU 203 has two serial ports 311, each of whidi is FIG. 3 is a simplified block diagram of an MCU 203 in 
coupled to a serial port of an adjacent MCU 203 to form a accordance wldi the preferred embodiment of the present 
bi-directional ring netw<vk« as shown in FIG. 2This arrange- invention. Each MCU 203 preferably includes a processor 
meat allows commimications to continue between flie array 35 301 coupled to a read-only memory (ROM) 303, a random 

nodes 213 even if one MCU 203 fails. However, a single access memory (RAM) 305. a first bus interface 307. a 
serial poit 311 in each MCU 203 could be used in conjunc- second bus device 309, and at least two serial interfaces 311. 
don with a standard network bus configuration. In any case. The first bus interface 307 is coupled to the processor 301 
messages can be passed between nodes 213 on the serial and to Che host bus 207. The second bus interface 309 is 
communications link 212 without participation by the host 40 coiq>led to the processor 301 and to a DSD bus 211. The first 
201 or interference with communications on the host bus and second bus interfaces 307, 309 may be implemented, for 
207. The serial communications link 212 preferably operates example, using the NCR 53C:90, 53C94, or 53C700 SCSI 
as a store-and- forward network, in known fashion. interface integrated circuits. 

The MCUs 203 mtcrfacc each of dicir coupled data The processor 301 is preferably a 'Transputer" from 
storage devices 209 to the host computer 201 to allow the 45 Inmos Coip<»ation. 'Transputer^ processors are specifically 
host computer 201 to write data to, and read data from, die designed for inteiprocessor communications over serial 
array of data storage devices 209 in the linked nodes 213 links at cmrent rates of 10 to 20 Mbits per second. Trans- 
such that control of the array is transparent to the host putcr process arc also designed to handle intenupts quickly, 
computer 201. From die perspective of the host computer and are thus well-suited for use in controllers. However. 
201. a large capacity data storage device appears to reside at 50 other processors, such as RISC jH^ooessors. could also be 
each of a selected number of host bus addresses (such as used to implement the invention. 

SCSI identification numbers) and appears to be equal in In the preferred embodiment each of the MCUs 203 is 
capacity to a number of the data storage devices 209 taken physicaUy mounted on one data storage device 209. Addi- 
tc^edier. The apparent data storage capacity at each host bus tional data storage devices 209 may be "'daisy chained*" to an 
address on the host bus 207 need not be die same as die 55 MCU 203 by ^Tpn^viate cable connection. The power and 
apparent storage capacity at eadi other host bus address. data/control connectors of the data su»-age device 209 upon 
In the invfenred embodiment of die present invention, data whidi an MCTU 203 is mounted are connected to die MCU 
Is stored on the data storage devices 209 In units known as 203. The MCU 203 has power and data/control connectors 
''blocks'* (whldi may be. for example, a sector on a disk that mimic the power and data/control connectors of the data 
drive). Data blocks are organized into 'logical disks**. One 60 storage device 209. The data/control and power connectors 
or more logical disks may be located on a physical data of die MCU 203 are respectively coupled to die host bus 207 
storage device 209. Each logical disk comprises a portion of and to die data storage device power source (not shown) in 
a physical data storage device 209, and is defined by a place of the cramectors of die data storage device 209. The 
physical data storage device number, starting block number, MCU 203 preferably is physically configured to conform to 
and number of blocks. Ix^cal disks are organized into 65 die form fector (e.g., or 3!^") of die attached data 
'logical volumes". Each logical volume compiises one or storage device 209 so that die MCU 203 can be retrofit into 
more logical disks, all having the same number of blocks. A pre-enstii)g systems. 
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In one embodiment of the present invention, the software configuration data structure that MCU 203 is maintaining* 

that controls the cpciatioo of the processor 301 of each the receiving MCU 203 updates its configuration data struc- 

MCU 203 is stored in the ROM 303. Upon initial application ture to reflect the new information received and forwards the 

of power to the MCU 203. the processor 301 Is "hard** new configuration data structure to the other MCUs 203 to 

vectoied to a start address in the ROM 303. In an alternative s bring them up to date. In addition, when any change is 

embodiment of the present invention, the ROM 303 onl^ sensed in the configuration of a node 213 (for example, an 

provides instructions which point to software code in a data attached data storage device 209 fails), that node 213 

storage device 209 coupled to die MCU 203 on the DSD bus updates the configuration data stiucnire and sends a "change 

211 or over the host bus 207. In the alternative embodiment configuration** message to each odier MCU 203. Through 

the ROM 303 provides instructions to load the software code lO this process, each node 213 will eventually have an exact 

fi^om the data storage device 209 into the RAM 305 local to duplicate of the system configuration data structure, 

the processor 301. In another embodiment of the present FIG. 4 is a diagram of a logical system configuration in 

invention, the control software code may be stcsed in an accordance with the preferred embodiment of the present 

electrically erasable read only memory (EEROM), dectri- invention. The logical system configuration is preferably a 

cally alterable read only memory (EAROM), or a similar 15 table that defines the array of data storage devices 209 in 

non-volatile, re-programmable memory device. The host terms of redundancy groups 400, logical volumes 402. and 

computer 201 downloads die software code to the MCU 203 logical disks 404. As noted above, each logical disk is 

and it causes the software code to be written into such a defined by a physical data storage device number 406. 

memory device by issuing instructions to the MCU 203 starting block number 408, and number of blocks 410. One 

across the host bus 207. By granting the host con^)utcr 201 20 method of defining an array of disks in diis manner is 

the ability to alter the software code run by the MCU disclosed in co-pending, co-owned U.S. patent aR)lication 

processor 301. updates to die software code can be made Scr. No, 07/612,220, the teachings of which arc hereby 

easily to operational MCUs 203 in the field. incorporated by reference. However, die invention cncom- 

In die preferred embodiment of the present invention, a passes any mctfiod of mapping the blocks of die airay of 

number of host bus addresses (such as SCSI LUN ideniifl- 25 physical data storage devices 209 so as to be able to translate 

catira numbers) are assigned to the array of MCUs 203. a host I/O request into locations fca- the relevant blocks. 

However, in the preferred embodiment die number of |n die preferred emt>odiment of die i^esent invention, die 

MCUs 203 can exceed the number of host bus addresses. logical system configuration dau structure is written to die 

The host bus address assigned to an MCU 203 is indicated data storage device 209 on which die MCU 203 is mounted, 

by a host bus address identification means 313 (such as a 30 The logical system configuration data sire is preferably 

jumper, progranunable read onl^ memory* dip switch, time-stanoped so that when an MCU 203 initialize^ itself, 

detachable connection, or any other means for indicating an that MCU 203 can determine whether Che configuration data 

address). If die numba of MCUs 203 exceed the number cf structure which that MCU 203 reads from the daU storage 

available host bus addresses, die one MCU 203 is associated device 209 is current. This determination can be made by 

widi each unique host bus address. The host bus address 35 each MCU 203 broadcasting its time-stamps to every other 

identification means 313 for each remaining MCU 203 is MCU 203. Each MCU 203 then compaics its time-stamp to 

configured to indicate **no address". In any case, eadi MCU each received time-stamp. IF die MCU 203 detennin^ that 

203 preferably may respond to any of the host bus addresses the configuration data structure read during initialization is 

assigned to die array of nodes 213. That is, each MCU 203 not current die MCU 203 queries one of die odier nodes 213 

"snoops" on the host bus 207 for addresses, command, and 40 for a copy of the configuration data structure maintained by 

data, and can request and receive control of the host bus 207, that node 213. Preferably, die queried node is one diat has 

Because host bus addresses or identification numbers are the most current time-stamp. 

**logically" assigned to each MCU coupled to the host bus In contrast to die addressing scheme of the host bus 207, 

207, die entire data storage array may appesa as a single eadi data storage device 209 is assigned to a particular 

device capable of responding to a single identification 45 address on each DSD bus 211 and responds only to diat bus 

number on the host bus 207, or may appear as a number of address. TherefOTC, each data storage device 209 of die 

independent devices, each having disaetc identification preferred embodiment of die present invention has a unique 

numbers on the host bus 207. The ability to have more than address among die devices connected to the same DSD bus 

one MCU 203 a^Tpcar to the host computer 201 as a single 211. 

node 213 on die host bus 207 means that the only limit on 50 Through die logical system configuration data structure, 

die number ofMCUs 203 that can be present on die host bus each MCU 203 can dctmninc the address of each odicr 

207 is die physical and electrical limitations imposed by die MCU 203 located on the serial communications link 212. 

ability of die bus drivers and receivers to reliably transmit The determination as to which of the two serial ports 311 of 

die signals between die host computer 201 and each MCU an MCU 203 a message is transmitted through is based upon 

203. 55 whidi direction yields the sh<xtest possible route, as dctcr- 

Ih die prefeired embodiment each MCU 203 mninming an mined by refetring to the logical system configuration data 

Identical configuration data structure that describes die net- structure. In die preferred embodiment of the present 

work of MCUs 203, data storage devices 209 coupled to invention, messages from one node 213 to anodier are 

DSD buses 211, and data stcffage devices 205 coupled received by an adjacent node 213. If the message is not 
directly to the host computer 201 on die host bus 207. Each 60 addressed to diat node 213, die message is sent to die next 

MCU 203 detomhies what devices are ooiqiled to its DSD adjacent node 213 on the padi to the destination node. Such 

bus 211 and die characteristics of diose devices (e.g., data **store and forward^ communications links are well-known 

block size, number of data blacks, etc.) by issuing a query in die ait. 

on its DSO bus 211 for each DSD bus identification number: In diepcefcired endx)diment, eadi MCU 203 transmits an 
Each MCU 203 communicates its own c<Mifigur8tion data 65 ^^operational** message at time intervals to the two MCUs 

structure to each odicr MCU 203. If an MCU 203 receives 203 to which it is direcdy coupled by Uie serial ports 311. 

a configuration data structure that difi'ers from the current Therefore, whenever one of the MCUs 203 fails, at least two 
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Other MCUs 203 in the ring are made aware of the failure by logical system configuration data structure which data stor- 
the fact that an expected "operational** message has not been age devices Z09 will be involved in the requested operation 
received. The first MCU 203 to detect the failure of another (l.e., the locations of the logical blocks that together com- 
MCU 203 to transmit the '^operational** message, generates prise the data specified in the host conQ)uter's I/O request), 
a ^'configuration change" message and transmits the change 5 and with which MCUs 203 those data storage devices 209 
to each of the other nodes 213. If Ae node 213 that was are associated. An MCU 203 Is 'involved** in a data opera- 
detected as having failed has, in fact not failed, then when tion if any portion of the data referenced in an VO request 
it receives the configuration change it transmits a "opera- (or associated redundancy information) is to be written to, <x 
tional enrar** message which is received by the two adjacent read from, a data storage device 209 coupled to that MCU 
nodes 213 (only one, in tfie case in which one leg of the lo 203. 

serial communications link 212 itself has failed). The Responsibility for communicating with the host conqputo: 

""operationai error** message is forwarded to each node 213 201 in response to the request is passed to the '^lead** MCU 

on the serial communication link 212. Therefore, when the 203 that is associated with the data storage device 209 to 

node 213 that sent die configuration change receives the which, or from whidt, the first data block is to transferred, 

message, that node 213 determines that die serial commu- is That "leacT MCU 203 then begins the transfer of die daU to 

nications link 212 itself has a defect and marks that defect or from the host computer 201. 

in the logical system configuration data structure. The lYansfer of control and coordination among the MCWs 

revised logical system configuration data structure is then 203 is accomplished by messages transmitted over the serial 

forwarded to all other nodes 213. communications link 212. Control is transferred from die 

In the preferred embodiment of the present invention, 20 lead MCU 203 to toe next MCU 203 in sequence so that each 

each MCU 203 has sufficient RAM 305 to cache data being data block is received by die host computer 201. or written 

handled by that MCU 203 for most I/O transactions. The to each data storage device 209, in pr<^ order. When the 

cache allows a 'Svritc complete** message to be sent to die last data block is transfeired between the host computer 201 

host con^uter 201 immediately upon receipt of data to be and an MCU 203 across the host bus 207. die MCU 203 that 

written by each MCU 203 (Le., data need not actually be 25 made the last transfer sends a "complete** message to the 

written to an attached data storage device 209 before the host computer 201 and disconnects from the host bus 207. 

*^witc complete" message is sent from an MCU 203 to the Coc^dlnatioa between MCUs 203 ensures diat only one 

host computer 201), MCU 203 responds at any one time to an I/O request by die 

The size of the cache determines how busy the node 213 host computer 201 on the host bus 207. In one embodiment 

can be and still handle a transfer without logically discon- 30 of the present invention, the last MCU 203 to have 

necting to allow die MCU 203 to complete a pending I/O responded to a particular I/O request from die host con^tcr 

operation. Some of die factors that determine die size of die 201 directed to a particular host bus address is responsible 

cache are the size of a data block to be written to the data for responding to die next I/O request made to diat host bus 

storage device 209 and die speed of die data storage device address. 

209 widi respect to die speed of die host bus 207 and die 3S In another embodiment of die present inventi(»i. one 

DSD bus 2tl. If it becomes necessary to disconnect from die particular MCU 203 is assigned primary responsibility for 

host computer 201, die host computer 201 must wait until responding to an I/O request to a particular host bus address, 

die busy MCU 203 has sufficient memory available in die That MCU 203 may pass responsibility for servicing 

cache to accept a rug data block from die host computer 201. requests on diat host bus address to anottier MCU 203 by 

At diat time, die disconnected MCU 203 reestaUishes die 40 sending a message to the odier MCUs 203 with a command 

connection with the host computer 201 using the host bus to service the request. 

address of the original I/O request made by die host com- The controlling MCU 203 is responsible for coordinating 

puter 201. die communication between the host oon^ter 201 and the 

A second way of in4)lcmcnting a cache is to dedicate one other MCUs 203 that are involved in an I/O operation. 

MCU 203 to caching, and use RAM or some other fast 45 However, die controlling MCU 203 need not be coupled to 

storage device in place of a data storage device 209. This one of the data storage devices 209 to which data is to be 

^cache" MCU would provide caching for the entire aziay of written. For exan^>le, if the data storage devices 209 coupled 

nodes 213. to an MCU 203 fail, but die MCU is odicrwise operational, 

Overview of Operation that MCU 203 can be selected as the controlling MCU» thus 

Using the present invention^ paired MCUs 203 can be so off-loading some processing tasks from other MCUs 203. As 
configured as a RAID 0 or 1 system, and three or more another example, the controlling MCU 203 may be busy, 
MCUs 203 can be configured as a RAID 0, 3, 4, or 5 system. causing I/O requests from die host computer 201 to be 
In such RAID-type configurations, it is the responsibility cf delayed. If die nominal controlling MCU 203 is suffidcndy 
each MCU 203 to coordinate I/O operations including any occupied, it selects anodier MCU 203 to control any new VO 
read-inodify-write operations and data rebuild operations) in 55 request from the host conq>utcr 201. This transfer of rcspon- 
such a manner as to make the combination of several data sibiUty can continue through odier MCUs 203. The deter- 
storage devices 209 aj^car to the host computer 201 as a mination as to which MCU 203 to select can be prc-dcfincd, 
single, large capacity, high bandwidth, reliable data storage or can be made by polling the o^cr MCU*s 203. 
device. To acconq;>lish this goal, one of the MCUs 203 is Non-Redundant Read or Write Operations 
initially responsible for responding to an I/O request from 60 A non-redundant operation is one in which data is written 
die host computer 201 to eidier read a block of data from, or to one data storage device 209 at one location, and cannot be 
write a block of data to, the array of data storage devices recovered i^n a failure of that data storage device 209. 
209. That "controUing" MCU 203 determines &om which. Non-redundant read and write operations are the same 
or to which« data storage devloe 209 the first data block of except for die direction of data flow, 
die requested data is to be read or written. That is, after 65 To perform a non-redundant VO operation, die host corn- 
receiving a data transfer request from the host compute 201, puter 201 places an VO command on die host bus 207 
die MCU 203 diat initially responded determines from die requesting data to be read frt)m or written to a particular set 
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of addresses at a particular host bus address. One of the immediately upon receipt of data to be written by each MCU 

MCUs 203 responds to the request, thereby accepting the 203» again minimizing disconnections, 

task of controlling ttie read (^>eration. When all the data that was requested has been transmitted 

In one embodiment of the present invention, the MCU to/frora the host computer 201. the MCU 203 that transmit- 

203 that responds is the last MCU 203 to have had respon- 5 ted or received the last data block passes conux)! of the host 

sibility for responding to the host oon^Mjter 201 on the same bus back to the oontroiling MCU 203. The controlling MCU 

host bus address as the present VO request If no previous 203 transmits an "operatioD complete"* message to the host 

attempt has been made by the host conqniter 201 to com- computer, and logically disconnects from the host computer 

municate at the host bus address of ttie present request, the 201. 

MCU 203 that is set by its address identification means 313 lO If an error occurs, the nodes 213 cancel the operation (if 

to the present host bus address responds to the request that option is selected by a user during a setup procedure). 

In another embodiment of the p-esent invention. Ihe If a node 213 fails to respond, the other nodes 213 will 

dcterminatioa as to which MCU 203 is to respond is made cancel the operation, or complete their data transfers (as 

by assigning a single MCU 203 to the task of responding to selected by the user during a setup procedure), 

each host command addressed to a specific host bus address. i5 FIGS. 5A and 5B are diagrams showing a non-redundant 

Therefore, when an I/O request is made by the host com- read operation in accordance with the preferred embodiment 

puter 201 with respect to a particular host bus address, a of the fHCScnt invention. FIG. 5A shows the steps for 

particular MCU 203 is responsible with responding to fee processing a command descriptor block (CDB) for read 

request. That MCU 203 is responsible for determining which operations. When an MCU 203 is addressed by the host 

MCU 203 is to be the controlling MCU 203 based upon Ac 20 computer 201, t he MC U 203 waits for a CDB from the host 

logical address of the request and how busy each MCU 203 con^jutcr 201 (STEP 500). TTie MCU 203 then uses the 

is at the time. logical system configuration data structure to convert the 

Once a controlling MCU 203 is selected, the controlUng blocks defined by the CDB to 'Vequest blocks'* that map the 

MCU 203 accepts a command descriptor block from the host data to be read onto the logical volum es and logical disks of 

computer 20L From the information in the command 25 the array for involved MCUs 203 (STEP 502). Request 

descriptor block, the controlling MCU 203 determines blocks are messages that define for eadi involved MCU 203 

which other MCUs 203 are involved in the VO operation. the logical blocks that must be read by each MCU 203 from 

and sends a "disk request" message to each. Each MCU 203 its associated data storage devices 209, and the order in 

that receives a disk request message queues up that which those blocks imist be merged to meet the UO request 

command, and executes it as soon as possible (since prior 30 from the host cximputer 201. The addressed MCU 20 3 then 

VO requests may still be pending). sends the request blocks to all involved MCUs 203 (STEP 

Responsibility for communicating in response to an VO S04). 

request by the host con^uter 201 on the host bus address is PKj. 5B shows the processing at each node 213 of request 

passed from one involved MCU 203 to another, based upon blocks for a non-redundant read cy cration. Each MCU 203 

the order in which the host con^ter 201 expects to see the 35 waits for a request block (STEP 510). Upon receiving a 

data returned or received Thus, each involved MCU 203 request block, each involved MCU 203 allocates appropriate 

which is coupled to a data storage device 209 from which buffers and sets up the disk I/O request requirc dtorea d the 

data is to be read can transfer data to the host computer 201 data identified in the received request blodcs (STEP 512). 

upon received the data from the relevant data storage device The req uested data is then read and stored in the allocated 

209 and upon being given responsibility for responding to 40 buffers (STEP 514). Each invoKed node 213 dien tests to see 

the request Each involved MCU 203 which is coupled to a whether it is the next node in the data transfer sequence, 

dato storage device 209 to which data is to be written can which is perf<Hmed by means of a direct memory access 

accept data from the host computer 201 upon being given (DMA) operation, in known fashion (STEP 516). If a node 

responsibility for responding to the request 213 is not next in the data transfer sequence, it waits for 

Each node 213 coordinates with the other nodes 213 via 43 notification from a notiier node 213 before starting its data 

serial messages so that as the next data block that the host transfer operation (STEP 518). 

computcx 201 is expecting or is sending becomes available. If a node 213 is the next in order, it adopts the identifi- 

the node that will transmit or receive that data block con- cation number of the devi ce tha t was initially addressed by 

nects with the host bus 207, identifies itself with the original the host computer 201 (STTH* 520). The node 213 then 

host bus address from the VO request being processed, and 50 transfers the data from its buffers to the host con^uter 201 

transfers the data block to^m the host computer 201. That ova Ae host bus 207 (STEP 522). After the transfer is 

node 213 then sends a completion message to the next node completed, the node 213 releases the identification number 

213 in sequence, which takes over for the nest data block. and notifies the next MCU 203 in seq uence that the current 

The timexequired to read a particular data blockfrom any node 213 has completed its transfer (STEP 524). 

particular data storage device 209 may be so long as to make S5 The node 213 fiien tests to see whether the just-completed 

it necessary to logically disconnect Out associated MCU 203 t ransfe r was tiie last data transfer required for this node 213 

from Ae host computa 201. In the case of a disconnection (STEP 526). If no, die node 213 wait s for n otification to start 

between Ihe host coznputcr 201 and an MCU 203, the MCU another data transfer in sequence (STEP 518). If yes, the 

203 that is responsible for communicating the next pending node 213 tests to see whether it has performed die last data 

data block must reestablish the connection with the host 60 transfer in its lequtred sequence, as defined in the request 

computer 201 on the same host bus address that fiie host blocks (STEP 528). If yes, the node 213 sends a cxamp ietion 

computer 201 used to request the data. status message to the h ost computer 201 (STEP 530), 

For a read operation, each MCU 203 can begin reading releases its buffers (STEP 532), and returns to the^mt of the 

immediately (assuming no other operations are pending), process. If no, the node 213 releases its buffers (STEP 532), 

and fiius disconnections can be mfpip''*'^ Similariy, for a 65 and returns to the start of the process, 

write operation, use of a cache or buffer allows a **write FIG. 6 is a diagram showing a non-redundant write 

complete** message to be sent to the host con^>ttter 201 operation in aoo(Hdance with the preferred embodiment of 
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Che present InventioD. Processing for a non-iedundant write 
request block is very similar to the processing to a non- 
redundant read request block. The steps for processing a 
command descriptor block (CDB) for write operations is 
essentially ttie same as the process shown in FIG. 5A fcH- s 
read operations. Steps in FIG. 6 that coircspond to similar 
stq>s in HG. Sb are marked with similar reference numbers. 
The principal dififerences in the write operation is that 
instead of reading the requested data block and storing the 
data in allocated buffers (step 514 in FIG. Sb). the process lo 
receives data from the host computer 201 to its buffers via 
the host bus 207 (step 522*), and writes the data from its 
buffers to the appropriate data storage device 209 aflcr the 
last transfer from the host computer 201 to the node 213 
(step 600). 15 
Redundant Read Operations 

If no failure of a node 213 occurs, redundant read c^a> 
tioos are carried out in the same manner as non-redundant 
read operations. However, when a failure occurs, each node 
213 that includes a data storage device 209 that is part of the 20 
redundancy group for requested data (i.e., those data storage 
devices 209 containing data that has been XORM together, 
including the data storage device 209 on which die parity 
data is stored) reads the relevant data blocks from the stripe 
involved in a read request from the host con^uter 201. 23 
These data blocks are then transferred over the host bus 207 
to the controlling MCU 203 (or to another designated MCU 
203) in order to compute the XOR sum necessary to rebuild 
that portion of the requested data stored on the failed node 
213. The rebuilt data block is then transfened to the host 30 
confer 201 in proper sequence. 

If only a data storage device 209 of a node 213 has failed, 
but (he associated MCU 203 is operational, that MCU 203 
can be designated to perform die rebuild task. If die node 
213 that includes the controlling MCU 203 has failed, then 3S 
control of the operation is passed to another MCU 203. The 
choice of MCU 203 to whidi control is passed may be 
pre-set in the logical system coniiguratioD data structure 
(i.e., a pre-planned order of succession), cr may be the first 
MCU 203 fhot detects failure of a node 213. Tlie new 40 
controlling MCU 203 completes the read operation and 
communicates ttic successful completion to the host com* 
puter 201 by transmitting an "operation complete" message. 

In the preferred embodiment of the present invention, if a 
**v/um spare" is available, the data that was on the failed 4S 
data storage device 209 is rebuilt and written to the warm 
spare. A warm spare is an extra node 213 which is configured 
such that it may replace a failed node 213. At least one warm 
spare is provided which replaces an entire node 213 when 
either a data storage device 209 or an MCU 203 fails. The 30 
warm spare node 213 is generally inactive until ttie failure 
occurs. 

In the case of a RAID 1 (mirrored drives) iixq>lementation, 
no rebuild is required, since duplicate data is kept on paired 
data storage devices 209. If one oi the pair fails, the 55 
requested data is simply read from the other data storage 
device 209 of the pair. 

FIG. 7 is a diagram showing a redundant read operation 
after a failure, in accordance with the preferred embodiment 
oi the present invention. A redundant read operation is 60 
essentiflJly the same as a non-redundant read operation, 
except in the case of a failure. FIG. 7 shows die process sti^s 
for handling a foilure. When a f aOure is sensed, a designated 
MCU 203 buflds a set of request Mocks for the affected 
redundancy group data storage devices 209, Identifying die 6S 
logical blocks from which a failed logical block can be 
recoDStrocted (STEP 700). That MCU 203 then sends die 
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request blocks to die affected MCUs 203, along with the 
identification of a taigct MCU 203 designated for perform- 
ing the rcbuUd operation (STEP 702). The target MCU 203 
typically would be die MCXJ that stores the parity data for 
the affected stripes of die redundancy group. 

The MCUs 203 process die request blocks to read the 
affected data, and transfer diat data to die target MCU 203 
(STEP 704). The target MCU 203 begins to rebuUd the lost 
data from the old data and from the old parity information 
(STEP 706). The target MCU 203 dien tests whether it has 
received all necessaiy data from die odier MCUs 203 (STEP 
708). If no, further data is transferred from the affected 
MCUs to the target MCU (STEP 704). If yes, die target 
MCU dien tests to see whedier die data from the rebuilt 
MCU is next in sequence to be transferred to th e host 
computer 201 (STEP 710). If the response to STEP 710 is 
no. the target MCU 203 waits for notification from die other 
MCUs 203 to start transferring the rebuilt data to the host 
computer 201 in the proper sequence (STEP 712). 

If the rebuilt MCU data is next in sequence to l>e trans- 
ferred to die host computer 201, dicn die target MCU 203 
transfers die rebuilt data to die host computer 201 (STEP 
714). The target MCU 203 then notifies die next MCU 203 
in the data transfer sequence (STEP 716). Thereafter, the rest 
of the read operation can be conqileted in the same manner 
as for a non-redundant read operation (STEP 71S). 
Redundant Writes 

In a RAID 1, 3. 4, 5 implementation of the present 
invention, a write operatioD proceeds when a host bus 
command is transmitted from die host computer 201 via die 
host bus 207 to a controlling MCU 203. Once the host bus 
command is received by an MCU 203 diat accepts the task 
of controlling the opemdon, that MCU 203 determines 
which of the other MCUs 203 are involved in the write 
operation by from die logical system configuration data 
structure. The involved MCUs 203 arc diose MCUs vMch 
are coupled to data storage devices 209 to wdiich data is to 
be written (refored to as ^'write" MCUs), or which are in die 
same redundancy group as diose data storage devices 209 to 
which data is to be written (refcned to as "Vedundancy** 
MCUs). The controlling MCU 203 communicates with each 
of the write MCUs 203 by sending a 'Vead old data** message 
on the serial communications link 212. 

To avoid writing concuirendy to two volumes that share 
the same parity block (i.e., two logical volumes within the 
same stripe), a lock table is maintained which prevents 
concurrent writes to blocks within the same stripe. 
Preferably, the controlling MCU 203 maintains the lock 
table. The controlling MCU 203 locks a range of blocks by 
sending a ^^lock-request** message over the serial communi- 
cations link 212, specifying the blocks to be locked, to each 
MCU 203 in the same stripe as a block to be modified. The 
controlling MCU 203 then waits for each such MCU 203 to 
send bade a '^ock granted** message. After completion of a 
modification within the locked stripe, die controlling MCU 
203 sends eadi locked MCU 203 an 'ninlock" message, 
specifying the blocks to be unlocked. 

The "read old data" operation for d»e write MCUs 203 is 
necessary in order to complete a '^ead-Modify-Write" 
(RMW) operation. Therefore, each of the data storage 
devices 209 to which data is to be written b instructed in 
sequence by its associated MCU 203 to begin reading die old 
data-liom those logical blocks to which new data is to be 
written. Bach write MCU 203 then takes control of die host 
bus 207 and transfers its old data blodito die MCU 203 that 
contains the ooiresponding reduodaocy block. Thereafter, 
each write MCU 203 connects to die host coniputcr 201 and 
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accepts a new data block to be writtea over the old data of the array, and further maps ccfftespondiog parity sectors 
block. Each write MCU 203 then causes the new block to be in the array for st oring t he computed parity cOTrcsponding to 
written to an attached data stcHage device 209. the written data (STEP 802). The addressed MCU 203 then 

In a RAID 3 or RAID 4 implementation^ one MCU 203 sends a lock message to each affected MCU 203. in order to 
is the redundancy MCU. In a RAID 5 implementation, each 5 lock the redundancy rows from being written by any other 
of the MCUs 203 can be a redundancy MCU, since the process during the time that the cunent process is writing to 
redundancy data is striped across the data storage devices such rows (STEP 804). The addressed MCU 203 then sends 
20i^ in the anray. the request block to all involved MCUs (STEP 806). 

Accordingly, die current redundancy MCU 203 reads the The addressed MCU 203 then waits for the affected 
old parity block associated with the current stripe to be lo MCUs 203 to complete the write operation (described in 
written with new data. In addition, the current redundancy FIG. 86) (STEP 808). After all data has been written, tiie 
MCU 203 accepts the old data block from the current write addressed MCU 203 sends an unlock message to all affected 
MCU and XOR^s it with the old parity block. Thereafter, the MCUs 203, to unlock the previously locked redundancy 
current redundancy MCU 203 passively reads {or "sno<^s'*) rows (STEP 810). The addressed MCU 203 then returns a 
the new data from the host bus 207 as such data is being 15 coniplction status flag to the host c omput er 201 to indicate 
transferred from the host computer 201 to the current write completion of the write operation (STEP 812). 
MCU 203. In this way, the new data can be XOR*d with die FIG. 8B shows the steps for processing redundant write 
old parity block and old data block as the new data block is request blocks in accordance with the present invention. The 
transferred to the data storage device 209 to which die new affected MCUs 203 wait for a request block (STEP 520). 
data block is to be written. An extra transmittal of the new 20 Upon received a request block, each affected MCU 203 tests 
data block to the redundancy MCU 203 is thus not required to see whether the block being written to it is a data block 

Responsibility for communicating with the host computer (STEP 822). If yes, the MCU 203 is a write MCU, and it 
201 in response to tiie host computer's 201 request fcx a initiates the p-ocess erf reading the corresponding old data 
write to a host bus address is passed from die MC^ 203 from one of its associated data storage devices 209 (STEP 
coupled to the data storage device 209 on whidi the first data 25 824), Meanwhile, die write MCU transfers flie new data to 
block is to be written, to die MCU 203 coupled to the data itself from the host con^utCT 201 over die host bus 207 
storage device 209 on which the second data block is to be (STEP 826). When reading of the old data has been 
written. Responsibility for satisfying the host computa's completed, die old data is transferred by the write M CU over 
201 write request continues to be passed from one MCU 203 the host bus 207 to the redundancy MCU (STEP 828). The 
to another undl each data block has been transfcned to the 30 write MCU then writes the new data to the appro priate one 
data storage devices 209 in an order that can be read back in of its associated data storage devices 209 (STEP 830). The 
die sequence that the host computer 201 expects. write MCU then notifies the controlling M CU o f the 

In die preferred embodiment of die present invention* coiiq)letion of that phase of the write opeiatioa (STEP 832). 
eadi involved MCU 203 communicates a **write coiEplete" On die odier hand, if Ifae data to be written to a n MCU 203 
message to die controlling MCU 203 when the involved 35 is a redundancy blodc radier dian a data block (STEP 822). 
MCU 203 has successfully received and written a data block then the MC!U is a redundancy MCU. The redundancy MCU 
to its data storage devices 209. Thus, if no cache is provided reads the old parity fr<Hn one of its associated data storage 
in eadi MCU 203, the current involved MCU 203 must wait devices 209 (STEP 834). The corresponding old data is 
until the data storage device 209 to which data is to be transferred from the relevant MCU to tiie redundancy MCU 
written responds with an indication that the data was sue- 40 over the host bus 207 (STEP 836). The redundancy MCV 
cessfuUy written before that MCU 203 can transmit a **writc then "snoops" the corresponding new data off of die host bus 
complete** message to die controlling MCU 203. Use of a 207 as the new data is being transferred by die host computer 
cache cr buffer allows a **writc complete" message to be sent 201 to the write MCU (STEP 838). New parity is computed 
to die host counter 201 immediately upon receipt of data from the old parity^ old data, and new data in known fashion, 
to be written by each MC:U 203. 45 and written to die appropriate da ta stor age device 209 

If a failure occurs (i.e.. eidier a data storage device 209 or controlled by die redundancy MCU (STEP 840), The rcdun- 
an MCU 203 fails) during die read portion of a icad-modify- dancy MCU ttien notifies die controlling M CU o f die 
write operation, the data on die failed node 213 is recon- conviction of diat i^iase of the write operation (STEP 842). 
stcucted by XOR*ing the data and redundancy information Automatic Rebuild 

stored in the other data storage devices 209 of the redun- so In the preferred embodiment of the present invention, a 
dancy group. If a failure occurs during the write portion of warm spare replaces an entire node 213 when cidicr die data 
a read-modify-write operation, the operation con^letes if storage device 209 or the MC!U 203 fails. The warm spare 
there is only one failure. If multiple failures occur, the is generally inactive until the failure occurs. When a warm 
operation is aborted and an error message is sent to the host spare is activated upon the occurrence of a failure, an 
by die controlling MCU 203. When either the MCU 203 or ss automatic rebuild is initiated which causes die information 
die data storage device 209 of die controlling node 213 fails ttiat was stored in die failed node 213 to be reconstructed in 
during a write operation, another MCU 203 takes over the warm spare. The inf onnadon stored in each of the otha 
control of die write operation. nodes 213 is used to rebuild die information diat was stored 

PEGS. SAand 8B are diagrams showing a redundant write in die failed node 215. The warm spare receives a message 
operation, in accordance with the prefecred embodiment of fio on die serial communications link 212 from one of die other 
die present invention. FIG. 8A shows die steps for process- nodes 213 indicating that a node 213 has failed. (Detection 
ing a command descriptor block (CDB) for a redundant of such failure is described above). Hie warm spate main- 
write operation. When an MCU 203 is addressed by die host tains a cunent copy of tbc logical system ooofiguration data 
computer 201, die MCU 203 waits for a CDB from the host structure and only requires infonnation regarding die 
conqniter 201 (STEP 800). When die CDB is received, die 65 address of the failed node 213 in order to determine which 
addressed JACV 203 builds a set of request blocks diat map other nodes 213 must be contacted to reconstxuct the data 
die data to be written to die logical volumes and logical disks that was stored in the failed node 213. Id die prefened 
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embodiment of the present invention in which a RAID X 4. 
or 5 array is in^Iemented, a bit-by-bit XOR'ing of the data 
biocks in each node 213 of the redundancy groi^ in which 
the failed node is included (with ^e exception of the failed 
node 213 and the wann spare) is used to reconslnict the data 
that was stored in the failed node 213. That is, each data 
block and the associated redundancy block for each stripe is 
read and then tiansmitted over the host bus 207 to the warm 
spare. The MCU 203 in the warm spare XOR's the received 
blocks for each stripe, and then writes the sum to its attached 
data storage devices 209 in the conesponding stripe. 

In Ihc case of a RAID 1 (mitrored drives) implementation, 
no rebuild is required, since duplicate data is kept on paired 
dau storage devices 209. J£ one of die pair fails, the data on 
the other data storage device 209 of the pair is singly copied 
to the warm spare. 

When the failed node 213 is repaired or replaced and 
returned to service, the data stored in the warm spare can be 
written to the formerly failed node 213 and the warm spare 
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Because of the economical design of the MCUs 203 and 
the use of "store and forward** sorial messaging between 
nodes 213. the present invention is partlculariy useful for 
relatively small arrays, ranging from 3 to about 8 data 
storage devices 209. The distributed, co-operative parallel 
processing of the invention provides high performance at 
low cost 

A number of embodiments of the present invention have 
been described Nevcftheless, it wfll be understood that 
various modiii cations may be made without departing from 
the spirit and scope of the invention. For example, any 
communications protocol or bus structure can be used to 
interface the MCUs to the host computer, to the data storage 
devices, and to each other. Furtheimorc, aldiough the above 
description refers to the communications between MCUs as 
being implemented on a serial communications link 212, the 
communications link may be any means for communicating 
information between the MUCs to coordinate and manage 
the individual data storage devices to create a coordinated 
array of reliable data storage device. Therefore, oommuni- 



returaed to an inactive state in anticipation of a foture 20 cations between MCUs be made aaoss any type of bus. 



failure. Alternatively, the repaired or replaced node 213 can 
be designated as the warm spare. 

In any case, the rebuild operation is preferably conducted 
'^On-line**, witii normal I/O operation of the host computer 
201 continuing, but possibly with some degradation in 73 
perfoimance. The on-line reconstruction process may be, for 
example, similar to the process described in co-pending U.S. 
patent plication Ser. No. 07/632,182. entitied *'On-line 
Restoration of Redundancy Information in a Redundant 
Array System** and assigned to the assignee of the present 30 
inyention. 

Examples of Use 

The invention provides great versatility. Increments in 
storage c^dty can be made by adding data storage devices 
209 to the DOD bus 211 of one or more MCUs 203 
(**veitical** expansion), or by adding additional MCUs 203 
with at least one attached data storage device 209 
(''horizontal** expansion). Horizontal expansion also 
increases tiie transaction bandwidth of the array, since more 
nodes 213 exist that are addressable by a host computer 201. 

For example, refening to FIG. 2 eadi ci tttt three nodes 
213a. 2l3b. 213c directly controls at least one data storage 
device 209. By adding more nodes 213. the cq^acity <tf the 
system is increased approxinoately linearly. In addition, as 
more nodes 213 are added, the peifonnance of the system 
increases, since each node 213 can handle a concurrent I/O 
request. Further, by sending messages to a target MCU 203. 
the odier MCUs 203 can access the data stcragc devices 209 
attached to the target node 213 by working through its MCU 
203. 

As another example, again referring to FIG. 2, node 213c 
directly controls three data storage devices 209. Furtiier, by 



or by wireless communications such as infrared or RF 
transmission As another example, the serial communications 
link 212 can be used to transfer data block between nodes if 
desired, which may be useful If the host bus 207 is heavily 
loaded with other daU traffic and a rebuild operation is 
underway. 

Accordingly, it is to be understood that the invention is 
not to be lii^ted by the specific illustrated embodiment, but 
only by the scope of the tippcudcd claims. 
We claim: 

1. A distributed storage array system configured to be 
coupled to a host computer having a host bus, comprisiDg: 

a plurality of data storage devices for storing and retriev- 
ing data in a selected sequence; 
a communications link; and 

a plurality of modular control units, each configured to 
comnuinicate wi^ the host computer directly over the 
host bus, each modular control unit being coupled to at 
least one corresponding data storage device and to each 
other modular control unit by said communications 
link, wherein: 

at least one modular control unit includes a reoeivex for 
receiving input/output requests from the host computer 
directly over the host bus for determining a next data 
storage device of a sequence of data storage devices 
involved in responding to a pending one of the received 
input/ou^ut requests; and 
each modular control unit includes a receiver f<H' receiv- 
ing configuration informaticHi from another modular 
control unit over the communications link separately 
from communications over the host bus. 

2. The distributed storage array system of claim 1 wherein 
said configuration information comprises control informa- 

sending messages to the associated MCU 203c, the other 55 tion for controlling said modular control units. 

MCUs 203 can access the data storage devices 209 attadied 3. The distributed storage array system of claim 1 wherein 

to node 213c by working through its MCU 203c. said configuration infonnation comprises configuration table 

FIG. 2 also shows that other data storage devices 205 may information, 
be directiy coupled to the host bus 207. Through the logical 4. The distributed storage array system of claim 1 wherein 
system configuration daU structure, each MCU 203 can be 60 said configuration information comprises a table of logical 
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made aware of these data storage devices 205 and can access 
them over the host bus 207. An advantage of having other 
data storage devices 205 on the host bus 207 is that they are 
direcdy addressable by each of the MCUs 203 over the host 
bus 207. Hence, data can be transferred between any MCU 65 
203 and a host bus addressable data storage device 2I^S at the 
speed of the host bus 207. 



system information. 

5. The distributed storage array system of claim 1 wherein 
said configuration information comprises infonnation defin- 
ing an array of said plurality of data storage devices. 

6. The distributed storage anay system of claim 1 wherein 
said configuration infonnatioo courses redundancy group 
infonnation. 
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7. The distributed storage airay system of claim 1 wherein each modular control unit includes a receiver for recciv- 
said configuration information comprises logical volume ing configuratioa iofoimation from another modular 
information. control unit over the communications link separately 

8. The distributed storage airay system of claim 1 wherein communications ova the host bus; and 

sdd configuration inf<Kmadon OTimirises logicd disk iofor- 5 . . . . _ Tir* 

jQ^^j^ a cache for cachmg at least some 1/0 transactions handled 

9. ThedistributedstQrageairaysystcmofclaimlwhcrcin by said modular control units. 

said configuration information comprises infoiroadon f(x 20. The distributed storage array system of claim 19 
tcansfexdng over said communications link responsibility wherein said at least one modular control unit including the 
for responding to the pending one of the received inpatf receiver for receiving the input/output requests further corn- 
output requests to the modular control units cocre^nding prises means for generating a write con^lctc message to die 
lo the sequence of involved data storage devices. host computer when data to be written by said modular 

10. The distributed storage array system of claim 1 oontrol unit is written to said cache. 

wherein said configuration information comprises a data 2I. The distributed storage airay system of claim 19 

stnicture configuratioD infonnation for each of said modular wherein said cache is sized to enable said modular control 

control units that is communicated from each of said modu- iq pxocess a transfer without logically disconnecting 

lar control units to another of said modular control units. ^^^^ computa. 

11. nie distributed storage airay system of daim 10 ^l. The distributed stwage airay system of daim 19 
wherein said moduUir control units each fmtha: comprise therein die at least selected ones of said pluraUty of 
means for updating its own coidguiationdaustt^ 20 modular control units inchidcs a RAM and wherein said 
sending a diange configuration message to another modular ^^^^ ^^^^ transactions comprises the 

"irV^e distributed storage array system of daim 10 ^^^TT'fS':^^ 

wherein said data storage devices fiirther comprise means of said plurahty of modular con^l umto^ 

for updating its own configuration data structure if said „ «>g f <^l?' " assoaated modular control umte. 

moduuTcontrol unit receives diangcd configuration data ^ '^^ distributed storage array system of dam 22 

and sending a change configuration message to another wherein the at least sdccted cmes of said plurahty of 

modular control unit modular control units induding the RAM comprises each of 

15. The distributed storage array system of claim 1 said pluraUty of modular oontrol units, 

wherein said configuration infonnatiott comprises time- 3. 24. The disttibuted stxrage array system of daim 19 

stan^ configuration data. wherein one of said modular control umts is dedicated to 

14. The distributed storage array system of claim 1 cadiing. 

wherein said modular oontrol units each further oon^jrise 25. A distributed stwagc array system configured to be 

means for detecting a faUure of anotha modular control coupled to a host computer having a host bus. composing: 

unit ^ a plurality of data storage devices for storing and retriev- 

15. The distributed storage array system of daim 14 ing data in a selected sequence; 
wherein said configuration infonnation comprises an opera- a communicatiotts link; and 

donal message provided from each modular control unit, ^ pluraUty of modular control units, each configured to 

and said means for detecting a faUure of another modular communicate with die host con^ter direcdy over die 

control unit c<»iiprises means for detecting an absence of ^ modular control unit being coi^led to at 

said operational message. 1^ corresponding daU storage device and to each 

16. Hie distributed storage array system of daim 11 modular control unit by said communications 
Mtodn said configuration informatioD con^rises a logical wherein: 

volume to physical vdume location map. nKxlular control unit indudcs a receiver for 

17. The distributed storage airay system of daim 11 ^5 receiving input/ou^ requests from the host com- 
wherein said configuration table infonnation comprises a diiecUy over die host bus for determining a 
redundancy group to jrfiysical volume location map. ^ storage device <rf a sequence of data 

18. The distributed stor^c array system of daim 11 storage devices involved in responding to a pending 
wherein said configuration infnmatioa comprises a logical received iiqHit/output requests; and 
data set to physical vohimc location map. ^ modular control unit indudes a receiver for 

19. A distributed storage array system configured to be rccdving information from ano&cr modular contrd 
coupled to a host conq)uter having a host bus, oon^Hising: communications Unk scparatdy from 

a pluraUty of data storage devices for storing and retriev- communications over die host bus. 

ing data in a sdected sequence; 26. The distributed array system of daim 25 wherein at 

a communications Unk; 55 least a p(xtion of said information received by each modular 

a plurality of modular control units, each configured to control unit comprises configuration information, 

communicate widi the host computer directiy over die 27. The distributed array system oi daim 2S wherein at 

host bus, each modular a»itrol unit bdng coupled to at least a poition of said infonnation received by each modular 

least one corr es ponding data storage device and to each control unit comprises oontnd information, 

other modular oontrol unit by said communications 60 28. The di^ributed airay system of daim 25 wherein at 

link, wherein: least a poirtion of said infonnatioa received by each modular 

at least one modular control unit indudes a receiver for control unit comprises data, 

receiving iiqiut/ouqput requests from die host computer 29. The distributed anay system of daim 25 whcrdn at 

directiy over die host bus for deteimi&ing a next data least a portion of said infonnation received by each modular 

storage device of a sequence of data storage devices 65 control unit comprises redundancy data, 
involved in responding to apending one of the leodved 

input/ou^t requests; and » » * » » 
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